Skip to content

[backend] Fix strict_dynamic_mapping_exception exceptions thrown in fileIndexManager (#14665)#14655

Open
fellowseb wants to merge 8 commits intomasterfrom
issue/89-strict-dynamic-mapping-exception-in-fileindexmanager
Open

[backend] Fix strict_dynamic_mapping_exception exceptions thrown in fileIndexManager (#14665)#14655
fellowseb wants to merge 8 commits intomasterfrom
issue/89-strict-dynamic-mapping-exception-in-fileindexmanager

Conversation

@fellowseb
Copy link
Member

@fellowseb fellowseb commented Feb 26, 2026

Context

The issue arose because of missing index mappings in the attachment sub-document: we use an Elasticsearch pipeline processor for attachments that extracts fields for us. By default this processor extracts all the fields it can: https://www.elastic.co/guide/en/elasticsearch/reference/8.19/attachment.html#attachment-fields.

The problem is that we've created index mapping sfor only a subset of those fields (see document.ts). This added to the fact that we enforce dynamic: strict behavior on indices, meaning we don't let unknown fields be pushed on an index, resulted in a few exceptions.

Proposed changes

  • This PR specifies which fields to extract when configuring the attachment pipeline in ES/OS: those for which we already have a mapping.
    We could consider ingesting the other pieces of data but I'm not sure it's useful and the volume is ultra low for now.

  • I added an integration test making sur that a PDF with a metadata (dc:publisher or dc:rating) that could be extracted by the processor, but that isn't because we now tell it not to, doesn't fail the indexing.

  • I ran the test with ES and OS locally. Running with OS required tweaking the dev setup to install the ingest-attachment plugin before starting the OS process which requires a custom Dockerfile (https://docs.opensearch.org/latest/install-and-configure/install-opensearch/docker/#working-with-plugins).

Related issues

Checklist

  • I consider the submitted work as finished
  • I tested the code for its functionality
  • I wrote test cases for the relevant uses case (coverage and e2e)
  • I added/update the relevant documentation (either on github or on notion)
  • Where necessary I refactored code to improve the overall quality

Further comments

I had to track down how to name the metadata by looking at the ES code (https://github.com/elastic/elasticsearch/blob/main/modules/ingest-attachment/src/main/java/org/elasticsearch/ingest/attachment/AttachmentProcessor.java#L200) and the library itself uses (Apache TIka). Without the dc: prefix it wouldn't be seen by the processor.

I used https://www.embedpdf.com/tools/pdf-metadata-editor to add metadata to the test file and tika to read them like the ES dependency.

@github-actions github-actions bot added the filigran team use to identify PR from the Filigran team label Feb 26, 2026
@codecov
Copy link

codecov bot commented Feb 26, 2026

Codecov Report

❌ Patch coverage is 98.00000% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 30.86%. Comparing base (c1b2181) to head (bf5d902).

Files with missing lines Patch % Lines
...ti-platform/opencti-graphql/src/database/engine.ts 92.85% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master   #14655      +/-   ##
==========================================
- Coverage   32.36%   30.86%   -1.51%     
==========================================
  Files        3097     3099       +2     
  Lines      210976   211013      +37     
  Branches    38232    37574     -658     
==========================================
- Hits        68280    65124    -3156     
- Misses     142696   145889    +3193     
Flag Coverage Δ
opencti-client-python 45.48% <ø> (ø)
opencti-front 2.83% <ø> (ø)
opencti-graphql 64.14% <98.00%> (-3.59%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@fellowseb fellowseb force-pushed the issue/89-strict-dynamic-mapping-exception-in-fileindexmanager branch from 698cd42 to 2ea4f68 Compare February 26, 2026 20:33
@fellowseb fellowseb marked this pull request as ready for review February 26, 2026 21:05
@fellowseb fellowseb self-assigned this Feb 26, 2026
@fellowseb fellowseb force-pushed the issue/89-strict-dynamic-mapping-exception-in-fileindexmanager branch from 3635ce9 to c4b70ac Compare February 26, 2026 21:16
@aHenryJard
Copy link
Member

Can you please create a public issue in OpenCTI repo and link it https://github.com/OpenCTI-Platform/opencti/issues

@fellowseb fellowseb force-pushed the issue/89-strict-dynamic-mapping-exception-in-fileindexmanager branch from 1f89b8f to c97d2f2 Compare February 27, 2026 08:51
@fellowseb fellowseb changed the title [backend] Fix strict_dynamic_mapping_exception exceptions thrown in fileIndexManager (#89) [backend] Fix strict_dynamic_mapping_exception exceptions thrown in fileIndexManager (#14665) Feb 27, 2026
@fellowseb
Copy link
Member Author

Can you please create a public issue in OpenCTI repo and link it https://github.com/OpenCTI-Platform/opencti/issues

Done. I 'll make sure the correct issue number is in the commit message when squashing too 👍 .

@fellowseb fellowseb force-pushed the issue/89-strict-dynamic-mapping-exception-in-fileindexmanager branch from c97d2f2 to f8aaca1 Compare February 27, 2026 10:39
typeof ATTACHMENT_MAPPINGS[number]['name']
> extends never
? MappingDefinition<BasicStoreAttribute>[]
: 'Make sure ATTACHMENT_MAPPINGS defines one mapping for each AttachmentProcessorExtractedProp';
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand well this part with this type check, could you explain it please? why is it needed?

Copy link
Member Author

@fellowseb fellowseb Feb 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So basically the issue with this exception came from not having defined a mapping for each field extracted from the document.
The solution I suggest is configuring the attachment ingest thing to extract only the fields we want.

I wanted a way to statically make sure those two lists (the mappings and the ingest process config) remain in sync so I came up with this "hack" that uses an intermediate variable TYPE_CHECKED_ATTACHMENT_MAPPINGS on which we assert one of two types:

  • either its correct type MappingDefinition<BasicStoreAttribute>[], if no mappings are missing (the never case)
  • either a string literal actually explaining the problem

While this was fun, it's not maybe the most clear way to do that (:, so I pushed another refactor commit to create an assertType utility that's a noop at runtime but is useful for compile-time type strict equality checks. Let me know if that's better ?! With that method I can keep the mappings definition in place.

@fellowseb fellowseb force-pushed the issue/89-strict-dynamic-mapping-exception-in-fileindexmanager branch 2 times, most recently from 01100d7 to ea8f73f Compare February 28, 2026 10:21
@fellowseb fellowseb force-pushed the issue/89-strict-dynamic-mapping-exception-in-fileindexmanager branch from ea8f73f to bf5d902 Compare February 28, 2026 11:01
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses strict_dynamic_mapping_exception errors during file indexing by constraining which fields the Elasticsearch/OpenSearch attachment ingest processor is allowed to extract, aligning ingestion with the existing strict index mappings.

Changes:

  • Configure the attachment ingest pipeline (properties) to only extract fields that are mapped (separately for Elasticsearch vs OpenSearch).
  • Add an integration test indexing a PDF containing metadata that would previously trigger strict mapping failures.
  • Add an OpenSearch dev Docker image build (with ingest-attachment plugin) and update dependent test expectations.

Reviewed changes

Copilot reviewed 8 out of 9 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
opencti-platform/opencti-graphql/tests/03-integration/04-manager/retentionManager-test.ts Updates expected file counts due to the new indexed test file.
opencti-platform/opencti-graphql/tests/03-integration/01-database/index-file-test.js Adds an integration test to validate indexing succeeds with “unhandled” PDF metadata.
opencti-platform/opencti-graphql/src/utils/type-utils.ts Adds TS utility types/helpers used for compile-time type assertions.
opencti-platform/opencti-graphql/src/modules/internal/document/document.ts Adds a compile-time check to keep attachment mappings aligned with extracted props.
opencti-platform/opencti-graphql/src/database/engine.ts Restricts ingest-attachment extracted properties for ES/OS pipelines.
opencti-platform/opencti-graphql/src/database/attachment-processor-props.ts Defines the explicit extracted-property allowlists (ES vs OpenSearch) + shared union type.
opencti-platform/opencti-dev/opensearch/Dockerfile Builds an OpenSearch image with the ingest-attachment plugin installed.
opencti-platform/opencti-dev/docker-compose.yml Switches OpenSearch service to build: the new Dockerfile and updates usage hint.

Comment on lines +100 to +103
const attachmentAttributes = attributes[18];

type AttachmentAttributeMappingNames = typeof attachmentAttributes.mappings[number]['name'][];

Copy link

Copilot AI Feb 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

attributes[18] is a fragile way to reference the attachment attribute. If the attributes array order changes, this will silently point at the wrong element and can cause a runtime crash when accessing .mappings (or make the type assertion validate the wrong mapping). Prefer selecting the attachment definition by name === 'attachment' (and failing fast if not found) to make this resilient to reordering.

Suggested change
const attachmentAttributes = attributes[18];
type AttachmentAttributeMappingNames = typeof attachmentAttributes.mappings[number]['name'][];
const attachmentAttributes = attributes.find((attribute) => attribute.name === 'attachment');
if (!attachmentAttributes || !attachmentAttributes.mappings) {
throw new Error('Attachment attribute definition with mappings not found in internal file attributes');
}
type AttachmentAttributeMappingNames = (typeof attachmentAttributes.mappings)[number]['name'][];

Copilot uses AI. Check for mistakes.
Comment on lines 313 to 315
remove_binary: true,
properties: ATTACHMENT_PROCESSOR_EXTRACTED_PROPS_ELASTICSEARCH as Mutable<typeof ATTACHMENT_PROCESSOR_EXTRACTED_PROPS_ELASTICSEARCH>,
},
Copy link

Copilot AI Feb 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The properties value is being forced to a mutable array via a type cast. Instead of as Mutable<...>, consider creating a real mutable array (e.g., by spreading into a new array) so the value’s runtime shape matches the declared type and you can avoid unsafe assertions.

Copilot uses AI. Check for mistakes.
Comment on lines +53 to +55
// Union type of all properties extracted by the ES or OS attachment processor
export type AttachmentProcessorExtractedProp = Mutable<typeof ATTACHMENT_PROCESSOR_EXTRACTED_PROPS_ELASTICSEARCH>[number]
| Mutable<typeof ATTACHMENT_PROCESSOR_EXTRACTED_PROPS_OPENSEARCH>[number];
Copy link

Copilot AI Feb 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using Mutable<typeof ...>[number] to extract the union of literal values is unnecessarily complex for const arrays/tuples. typeof ATTACHMENT_PROCESSOR_EXTRACTED_PROPS_ELASTICSEARCH[number] (and similarly for OpenSearch) should produce the same union type and makes the intent clearer.

Copilot uses AI. Check for mistakes.

# Disabled by default, to run use:
# docker compose --profile opensearch up -d
# podman compose --profile opensearch up -d
Copy link

Copilot AI Feb 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The usage hint here was changed to podman compose, but the rest of this file still documents docker compose for other profiles. If Podman is not required, keep docker compose (or mention both) to avoid confusing contributors following the dev setup instructions.

Suggested change
# podman compose --profile opensearch up -d
# docker compose --profile opensearch up -d
# or: podman compose --profile opensearch up -d

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

filigran team use to identify PR from the Filigran team

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants